Assignment Objectives

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


Question 1: Cumulative Distribution Function (CDF) Estimation

The following failure times (in hours) were observed for 8 electronic components:

23, 45, 67, 89, 112, 156, 189, 245
  1. Write an R function implementing the ECDF \(\hat{F}_n(t)\) according to its mathematical definition. Validate your implementation using R’s ecdf() function on the given data, with comparison based on their step functions.

We are given the following definition for \(\hat{F}_n(t)\)

\[ \hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I} \left(t_n < t \right)\] where,

\[ \mathbb{I}(t) = \begin{cases} 0 & t_n > t\\ 1 & t_n \leq t \end{cases} \]

random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)


sams_ecdf <- function(x) {
  #sort in ascending order
  #sample <- sort(sample, decreasing=FALSE)
  n <- length(x)
  function (t) {
    mat <- outer(x, t, FUN="<=")
    return (colSums(mat) / n)
  }
}

#some tests to check my function behaves correctly
r_lang_ecdf <- stats::ecdf(random_sample)
sam_ecdf <- sams_ecdf(random_sample)

test_points <- seq(20, 250, 1)
results <- r_lang_ecdf(test_points) == sam_ecdf(test_points)
print(paste("fails: ", length(results[!results])))
[1] "fails:  0"
print(paste("passes: ", length(results[results])))
[1] "passes:  231"
x_points <- seq(20, 250, 0.5)
plot <- ggplot(data=data.frame(x=x_points, y_1=sam_ecdf(x_points), y_2=r_lang_ecdf(x_points))) +
  geom_point(aes(x=x, y=y_1, color="Sam's ECDF")) + 
  geom_point(aes(x=x, y=y_2, color="R's ECDF"))

ggplotly(plot)
  1. A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)
r_ecdf <- stats::ecdf(random_sample)
print(r_ecdf(100))
[1] 0.5

Here we compute ECDF(100 hours). This is the porportion of data values that are less than or equal to 100. This value approximates the true CDF, the probability that a failure occurs at or before 100 hours. From the given data we get that ECDF(100) = 0.5, which is what the a colleague claims is the probability of failure. I agree that from these data, 0.5 is a reasonable estimation for the probability of failure before 100 hours.

Question 2: Density Function Estimation

Consider the following failure times from a mechanical system:

12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4
  1. Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram’s distribution.
random_sample <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
d_frame <- data.frame(x=random_sample)
plot <- ggplot(data=d_frame, aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),bins=3)
ggplotly(plot)
  1. Write an R function that computes kernel density estimates using a Gaussian kernel with \(h=2\). Validate your implementation against R’s built-in density() function.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}. \]

kernel_density <- function(sample, K=dnorm, h=2) {
  n <- length(sample)
  function(t, h=2) {
    k <- K( outer(t, sample, FUN="-") / h)
    s <- rowSums(k)
    return (s / n / h)
  }
}

#x <- rnorm(100)
#x <- rgamma(100, 2)
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
#dat <- data.frame(x=x, y=K(x))
#ggplot(dat) +
#  geom_point(aes(x, y)) +
#  geom_line(data=density(x), aes(x=x,y=y))
plot(density(x, kernel="gaussian", bw=2))
my_gaussian_kernel_density <- kernel_density(x)
#print(my_gaussian_kernel_density(-10:10:0.1))
points(x=seq(-10, 50, 0.1), y=my_gaussian_kernel_density(seq(-10, 50, 0.1)))

c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with \(h=2\). Validate your implementation by comparing results with R’s built-in density() function for Gaussian kernel estimation.

\[ \hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1. \]

x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

K <- function(u) {
    mask <- abs(u) <= 1
    return (((3/4) * (1 - u^2)) * mask)
}

kernel_density <- function(sample, K=dnorm) {
  function(t, h=2) {
    n <- length(sample)
    #compute a matrix where each row are the
    #terms to sum for each element of t
    k <- K(outer(t, sample, FUN="-") / h)
    #sum for each value of t
    s <- rowSums(k)
    #scale and return
    return (s / h / n)
  }
}

my_epanechnikov_kernel_density <- kernel_density(x, K=K)
x_axis <- seq(5, 30, 0.01)


plot(density(x, kernel="gaussian", bw=2), main="Gaussian and epanechnikov kernels")
points(x=x_axis, y=my_epanechnikov_kernel_density(x_axis), pch = 20, cex = 0.1)

  1. How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select \(h=1.5\) versus \(h=2.5\)?
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
plot <- ggplot(data=data.frame(x=x), aes(x=x)) +
  geom_density(bw=1.5, kernel="gaussian", aes(color="gausian h=1.5")) +
  geom_density(bw=1.5, kernel="epanechnikov", aes(color="epanechnikov h=1.5")) +
  geom_density(bw=2.5, kernel="gaussian", aes(color="gausian h=2.5")) +
  geom_density(bw=2.5, kernel="epanechnikov", aes(color="epanechnikov h=2.5")) +
  labs(title="Impact of Binwidth and Kernel on Density Approximation")

ggplotly(plot)
---
title: "Assignment 1: Estimating CDF and PDF"
author: "Samuel Johnson"
header-includes:
  - \usepackage{amssymb}
date: " Due: 2/3/2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}

if (!require("plotly")) {
  install.packages("plotly")
  library(plotly)
}
####
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```
 
 \
 
## **Assignment Objectives** 

* Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

* Translate mathematical formulas into R functions and apply them to solve related problems.

* Create effective visualizations to demonstrate your understanding of key concepts in the following questions.



\

## **Question 1: Cumulative Distribution Function (CDF) Estimation**

The following failure times (in hours) were observed for 8 electronic components:

<center> 23, 45, 67, 89, 112, 156, 189, 245  </center>

a) Write an R function implementing the ECDF $\hat{F}_n(t)$ according to its mathematical definition. Validate your implementation using R's ecdf() function on the given data, with comparison based on their step functions.

We are given the following definition for $\hat{F}_n(t)$

$$ \hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I} \left(t_n < t \right)$$
where,

$$ \mathbb{I}(t) = 
  \begin{cases}
      0 &  t_n > t\\
      1 & t_n \leq t
  \end{cases} 
$$
```{r}
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)


sams_ecdf <- function(x) {
  #sort in ascending order
  #sample <- sort(sample, decreasing=FALSE)
  n <- length(x)
  function (t) {
    mat <- outer(x, t, FUN="<=")
    return (colSums(mat) / n)
  }
}

#some tests to check my function behaves correctly
r_lang_ecdf <- stats::ecdf(random_sample)
sam_ecdf <- sams_ecdf(random_sample)

test_points <- seq(20, 250, 1)
results <- r_lang_ecdf(test_points) == sam_ecdf(test_points)
print(paste("fails: ", length(results[!results])))
print(paste("passes: ", length(results[results])))
```
```{r}
x_points <- seq(20, 250, 0.5)
plot <- ggplot(data=data.frame(x=x_points, y_1=sam_ecdf(x_points), y_2=r_lang_ecdf(x_points))) +
  geom_point(aes(x=x, y=y_1, color="Sam's ECDF")) + 
  geom_point(aes(x=x, y=y_2, color="R's ECDF"))

ggplotly(plot)
```


b) A colleague claims that the probability of failure before 100 hours is 0.5 based on these data. Do you agree? Explain your reasoning using the empirical cumulative distribution function (ECDF).

```{r}
random_sample <- c(23, 45, 67, 89, 112, 156, 189, 245)
r_ecdf <- stats::ecdf(random_sample)
print(r_ecdf(100))
```
Here we compute ECDF(100 hours). This is the porportion of data values that are less than or equal to 100. This value approximates the true CDF, the probability that a failure occurs at or before 100 hours. From the given data we get that ECDF(100) = 0.5, which is what the a colleague claims is the probability of failure. I agree that from these data, 0.5 is a reasonable estimation for the probability of failure before 100 hours.


## **Question 2: Density Function Estimation**

Consider the following failure times from a mechanical system:

<center> 12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4 </center>

a) Create a histogram of the data using 3 equally spaced bins. What is the estimated density in each bin? Describe the shape of the histogram's distribution.

```{r}
random_sample <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
d_frame <- data.frame(x=random_sample)
plot <- ggplot(data=d_frame, aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),bins=3)
ggplotly(plot)
```


b) Write an R function that computes kernel density estimates using a Gaussian kernel with $h=2$. Validate your implementation against R's built-in density() function.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{1}{\sqrt{2\pi}} e^{-u^2/2}.
$$
```{r}


kernel_density <- function(sample, K=dnorm, h=2) {
  n <- length(sample)
  function(t, h=2) {
    k <- K( outer(t, sample, FUN="-") / h)
    s <- rowSums(k)
    return (s / n / h)
  }
}

#x <- rnorm(100)
#x <- rgamma(100, 2)
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
#dat <- data.frame(x=x, y=K(x))
#ggplot(dat) +
#  geom_point(aes(x, y)) +
#  geom_line(data=density(x), aes(x=x,y=y))
plot(density(x, kernel="gaussian", bw=2))
my_gaussian_kernel_density <- kernel_density(x)
#print(my_gaussian_kernel_density(-10:10:0.1))
points(x=seq(-10, 50, 0.1), y=my_gaussian_kernel_density(seq(-10, 50, 0.1)))

```
c) Write a custom R function that computes kernel density estimates using the Epanechnikov kernel with $h=2$. Validate your implementation by comparing results with R's built-in density() function for Gaussian kernel estimation.

$$
\hat{f}_h(t) = \frac{1}{nh}\sum_{i=1}^n K\left( \frac{t-t_i}{h}\right), \ \ \text{ where } \ \ K(u) = \frac{3}{4}(1 - u^2) \ \ \text{ for } \ \ |u| \le 1.
$$

```{r}

x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)

K <- function(u) {
    mask <- abs(u) <= 1
    return (((3/4) * (1 - u^2)) * mask)
}

kernel_density <- function(sample, K=dnorm) {
  function(t, h=2) {
    n <- length(sample)
    #compute a matrix where each row are the
    #terms to sum for each element of t
    k <- K(outer(t, sample, FUN="-") / h)
    #sum for each value of t
    s <- rowSums(k)
    #scale and return
    return (s / h / n)
  }
}

my_epanechnikov_kernel_density <- kernel_density(x, K=K)
x_axis <- seq(5, 30, 0.01)


plot(density(x, kernel="gaussian", bw=2), main="Gaussian and epanechnikov kernels")
points(x=x_axis, y=my_epanechnikov_kernel_density(x_axis), pch = 20, cex = 0.1)
```



d) How does the choice of kernel (Gaussian vs. Epanechnikov) affect the density estimate? For both kernel estimators applied to this dataset, what happens when we select $h=1.5$ versus $h=2.5$?

```{r}
x <- c(12.3, 14.7, 15.2, 16.8, 18.1, 19.4, 20.6, 22.3, 23.9, 25.4)
plot <- ggplot(data=data.frame(x=x), aes(x=x)) +
  geom_density(bw=1.5, kernel="gaussian", aes(color="gausian h=1.5")) +
  geom_density(bw=1.5, kernel="epanechnikov", aes(color="epanechnikov h=1.5")) +
  geom_density(bw=2.5, kernel="gaussian", aes(color="gausian h=2.5")) +
  geom_density(bw=2.5, kernel="epanechnikov", aes(color="epanechnikov h=2.5")) +
  labs(title="Impact of Binwidth and Kernel on Density Approximation")

ggplotly(plot)

```




